cdrh3 sequence
Attribution assignment for deep-generative sequence models enables interpretability analysis using positive-only data
Frank, Robert, Widrich, Michael, Akbar, Rahmad, Klambauer, Günter, Sandve, Geir Kjetil, Robert, Philippe A., Greiff, Victor
Generative machine learning models offer a powerful framework for therapeutic design by efficiently exploring large spaces of biological sequences enriched for desirable properties. Unlike supervised learning methods, which require both positive and negative labeled data, generative models such as LSTMs can be trained solely on positively labeled sequences, for example, high-affinity antibodies. This is particularly advantageous in biological settings where negative data are scarce, unreliable, or biologically ill-defined. However, the lack of attribution methods for generative models has hindered the ability to extract interpretable biological insights from such models. To address this gap, we developed Generative Attribution Metric Analysis (GAMA), an attribution method for autoregressive generative models based on Integrated Gradients. We assessed GAMA using synthetic datasets with known ground truths to characterize its statistical behavior and validate its ability to recover biologically relevant features. We further demonstrated the utility of GAMA by applying it to experimental antibody-antigen binding data. GAMA enables model interpretability and the validation of generative sequence design strategies without the need for negative training data.
BetterBodies: Reinforcement Learning guided Diffusion for Antibody Sequence Design
Vogt, Yannick, Naouar, Mehdi, Kalweit, Maria, Miething, Christoph Cornelius, Duyster, Justus, Boedecker, Joschka, Kalweit, Gabriel
However, the discovery of therapeutic antibodies through traditional wet lab methods is expensive and time-consuming. The use of generative models in designing antibodies therefore holds great promise, as it can reduce the time and resources required. Recently, the class of diffusion models has gained considerable traction for their ability to synthesize diverse and high-quality samples. In their basic form, however, they lack mechanisms to optimize for specific properties, such as binding affinity to an antigen. In contrast, the class of offline Reinforcement Learning (RL) methods has demonstrated strong performance in navigating large search spaces, including scenarios where frequent real-world interaction, such as interaction with a wet lab, is impractical. Our novel method, BetterBodies, which combines Variational Autoencoders (VAEs) with RL guided latent diffusion, is able to generate novel sets of antibody CDRH3 sequences from different data distributions. Furthermore, we reflect biophysical properties in the VAE latent space using a contrastive loss and add a novel Q-function based filtering to enhance the affinity of generated sequences. In conclusion, methods such as ours have the potential to have great implications for real-world biological sequence design, where the generation of novel high-affinity binders is a cost-intensive endeavor. Antibodies are a class of proteins with great potential for treating diseases such as cancer (Kaplon et al., 2023; Norman et al., 2020; Robert et al., 2022).
Stable Online and Offline Reinforcement Learning for Antibody CDRH3 Design
Vogt, Yannick, Naouar, Mehdi, Kalweit, Maria, Miething, Christoph Cornelius, Duyster, Justus, Mertelsmann, Roland, Kalweit, Gabriel, Boedecker, Joschka
The field of antibody-based therapeutics has grown significantly in recent years, with targeted antibodies emerging as a potentially effective approach to personalized therapies. Such therapies could be particularly beneficial for complex, highly individual diseases such as cancer. However, progress in this field is often constrained by the extensive search space of amino acid sequences that form the foundation of antibody design. In this study, we introduce a novel reinforcement learning method specifically tailored to address the unique challenges of this domain. We demonstrate that our method can learn the design of high-affinity antibodies against multiple targets in silico, utilizing either online interaction or offline datasets. To the best of our knowledge, our approach is the first of its kind and outperforms existing methods on all tested antigens in the Absolut!